In [5]:
import pandas as pd
names = pd.DataFrame({"name" : ["Alice","Bob","Charlie","Dennis"],
"surname" : ["Doe","Smith","Sheen","Quaid"]})
names
Out[5]:
In [4]:
names.name.str.match("A\w+")
Out[4]:
In [7]:
debts = pd.DataFrame({"debtor":["D.Quaid","C.Sheen"],
"amount":[100,10000]})
debts
Out[7]:
Imagine I want to have a list of my friends with the amount of money I borrowed to each other, toghether with their names and surnames.
In [10]:
debts["surname"] = debts.debtor.str.extract("\w+\.(\w+)")
debts
Out[10]:
To merge two dataframes we can use merge function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html. Merge DataFrame objects by performing a database-style join operation by columns or indexes.
We can use merge in two different ways (very simmilar):
Common parameters
on : label or list
Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per left_on docs
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
right_index : boolean, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index
In [12]:
names.merge(debts, left_on="surname", right_on="surname", how="left")
Out[12]:
In [13]:
names.merge(debts, left_on="surname", right_on="surname", how="right")
Out[13]:
In [14]:
names.merge(debts, left_on="surname", right_on="surname", how="inner")
Out[14]:
In [15]:
names.merge(debts, left_on="surname", right_on="surname", how="outer")
Out[15]:
In [17]:
names.merge(debts, left_index=True, right_index=True, how="left")
Out[17]:
In [20]:
lat_lon_mun = pd.read_excel("lat_lon_municipalities.xls", skiprows=2)
lat_lon_mun.head()
Out[20]:
In [27]:
mun_codes = pd.read_excel("11codmun.xls", encoding="latin1", skiprows=1)
mun_codes.head()
Out[27]:
In [33]:
lat_lon_mun[lat_lon_mun["Población"].str.match(".*anaria.*")]
Out[33]:
In [34]:
mun_codes[mun_codes["NOMBRE"].str.match(".*anaria.*")]
Out[34]:
In [37]:
"Valsequillo de Gran Canaria" == "Valsequillo de Gran Canaria"
Out[37]:
In [36]:
"Palmas de Gran Canaria (Las)" == "Palmas de Gran Canaria, Las"
Out[36]:
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match. Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).
Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.
The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.
Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install.
The simplest way to compare two strings is with a measurement of edit distance. For example, the following two strings are quite similar:
NEW YORK METS
NEW YORK MEATS
Now, according to the ratio:
Return a measure of the sequences' similarity as a float in the range [0, 1]. Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.
In [28]:
from fuzzywuzzy import fuzz
fuzz.ratio("NEW YORK METS","NEW YORK MEATS")
Out[28]:
In [35]:
fuzz.ratio("Palmas de Gran Canaria (Las)","Palmas de Gran Canaria, Las")
Out[35]:
In [39]:
"San Millán de Yécora" == "Millán de Yécora"
Out[39]:
In [40]:
fuzz.ratio("San Millán de Yécora", "Millán de Yécora")
Out[40]:
In fact we can have the following situation:
In [42]:
fuzz.ratio("YANKEES", "NEW YORK YANKEES")
Out[42]:
In [44]:
fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES")
Out[44]:
partial_ratio, seeks the more appealing substring and returns its ratio
In [45]:
fuzz.partial_ratio("San Millán de Yécora", "Millán de Yécora")
Out[45]:
In [46]:
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES")
Out[46]:
In [47]:
fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES")
Out[47]:
In [48]:
s1 = "Las Palmas de Gran Canaria"
s2 = "Gran Canaria, Las Palmas de"
s3 = "Palmas de Gran Canaria, Las"
s4 = "Palmas de Gran Canaria, (Las)"
In [49]:
fuzz.token_sort_ratio("Las Palmas de Gran Canaria", "Palmas de Gran Canaria Las")
Out[49]:
In [50]:
fuzz.ratio("Las Palmas de Gran Canaria", "Palmas de Gran Canaria Las")
Out[50]:
The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.
t0 = [SORTED_INTERSECTION]
t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]
max(ratio(t0,t1),ratio(t0,t2),ratio(t1,t2))
In [52]:
t0 = ["Canaria,","de","Gran", "Palmas"]
t1 = ["Canaria,","de","Gran", "Palmas"] + ["Las"]
t2 = ["Canaria,","de","Gran", "Palmas"] + ["(Las)"]
In [53]:
fuzz.token_sort_ratio("Palmas de Gran Canaria, Las", "Palmas de Gran Canaria, (Las)")
Out[53]:
We want to merge both mun_codes and lat_lon_mun. So we have to have a good municipality name in both datasets. From that names we can do:
In [54]:
mun_codes.shape
Out[54]:
In [56]:
mun_codes.head()
Out[56]:
In [55]:
lat_lon_mun.shape
Out[55]:
In [57]:
lat_lon_mun.head()
Out[57]:
In [111]:
df1 = mun_codes.merge(lat_lon_mun, left_on="NOMBRE", right_on="Población",how="inner")
df1.head()
Out[111]:
In [112]:
df1["match_ratio"] = 100
In [113]:
df2 = mun_codes.merge(df1, left_on="NOMBRE", right_on="NOMBRE", how="left")
df2.head()
df3 = df2.loc[: ,["CPRO_x","CMUN_x","DC_x","NOMBRE","match_ratio"]]
df3.rename(columns={"CPRO_x": "CPRO", "CMUN_x":"CMUN","DC_x":"DC"},inplace=True)
df3.head()
Out[113]:
In [115]:
df3.loc[df3.match_ratio.isnull(),:].head()
Out[115]:
In [127]:
mun_names = lat_lon_mun["Población"].tolist()
def approx_str_compare(x):
ratio = [fuzz.ratio(x,m) for m in mun_names]
res = pd.DataFrame({"ratio" : ratio,
"name": mun_names})
return res.sort_values(by="ratio",ascending=False).iloc[0,:]
df4 = df3.loc[df3.match_ratio.isnull(),"NOMBRE"].map(approx_str_compare)
In [135]:
df4.map(lambda x: x["name"])
df4.map(lambda x: x["ratio"])
df6 = df3.loc[df3.match_ratio.isnull(),:]
df6["match_ratio"] = df4.map(lambda x: x["ratio"])
df6["NOMBRE"] = df4.map(lambda x: x["name"])
df6.head()
Out[135]:
In [138]:
df7 = pd.concat([df3,df6])
Improve the pipeline taking into consideration that we haven't checked that all names are unique. What is happening? Can you solve it?
Work with both NOMBRE and Población variables to improve results.
Make approximate string matxing procedure more efficient.
Porvide the final merged dataset.